Data obtained from scraping The Athenaeum.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
from PIL import Image
import re
We collected data for artists and for works (a.k.a. paintings). Their original state is as follows.
authors = pd.read_csv('data/athenaeum_authors.csv')
authors.sample(5)
authors.describe()
We can see that every author has a birth year, but not all have a death year (understandably). For that reason, elements in that column show up as float. We also find some inconsistent values, such as the minimum death year being smaller than the minimum birth year, or a birth year of 2669.
Given that we're not using those values in our models, and that the paintings themselves have an year assigned to them, we chose to ignore these fields.
authors.describe(include = [object])
We find that not all authors have a known first name, and the same goes for the last name if we take into account that 43 of them are simply Unknown. Only 6.6% of them have a biography on the website we scraped, and only 16.6% have an assigned art movement.
We can further look into how the Unknown last name does with first names.
authors.loc[authors['last_name'] == 'Unknown', 'first_name'].value_counts()
As for the works table, the general aspect of it follows. We omit the image_out, height_uom and width_uom because they tell us absolutely no information (they're either nan or 'cm').
paintings = pd.read_csv('data/athenaeum_paintings.csv').drop(['image_out', 'height_uom', 'width_uom'], axis = 1)
paintings.sample(5)
paintings.describe()
We have records of 207,353 art works. Here, height and width refer to the original height and width of the art pieces in centimeters.
paintings.describe(include = [object])
One thing that immediately stands out is the fact that 3 painting urls are repeated.
paintings[paintings['painting_url'].duplicated(keep = False)]
These conflicts were likely created by the fact that the scraper ran several times and the website database changed, recicling the IDs for some entities. To be clear, not only authors and articles have an ID, but also every image, whose ID generally has no relationship with the former ones.
Another important part of our data is the set of images. We scraped the full sized versions, while simultaneously making a resized 200x200 copy. Unfortunately, mostly due to server errors, not all of the works downloaded successfully. The painting_sizes table is in essence the same as the paintings table, filtered for the artworks that have a downloaded image.
painting_sizes = pd.read_csv('data/athenaeum_paintings_sizes.csv')
painting_sizes[['height_px', 'width_px']].describe()
Out of 207353 artworks, 108 of them failed to download.
painting_sizes[['height_px', 'width_px']].plot.kde()
plt.xlim(100, 3000)
t = plt.title('Distribution of images dimensions in pixels')
At a first glance, it seems like the majority of the images are horizontal. To verify that claim, we look at their height/width ratio.
hwlogratio = np.log2(painting_sizes['height_px']) - np.log2(painting_sizes['width_px'])
hwlogratio.describe()
hwlogratio.plot.kde()
plt.title('Distribution of Log-Ratio between Height and Width of images')
t = plt.xlim(-2, 2)
Let's now take a look at the types of artworks we have.
painting_sizes['article_type'].value_counts()[::-1].plot.barh()
plt.title('Artwork counts by type')
plt.xscale('log')
t = plt.xlabel('Quantity')
We can also look for image samples of each type.
def show_images_for_type(article_type):
sample_painting = paintings[paintings['article_type'] == article_type].sample(8)
f, ax = plt.subplots(2, 4, figsize = (18,9))
for i in range(8):
im = Image.open('data/images_athenaeum/full/%d/%d.jpg' % (sample_painting.iloc[i]['author_id'],
sample_painting.iloc[i]['painting_id']))
curAxis = ax[i / 4, i % 4]
curAxis.imshow(im)
curAxis.set_xticks([])
curAxis.set_yticks([])
show_images_for_type('Sculpture')
show_images_for_type('Stained glass')
show_images_for_type('Print')
show_images_for_type('Assemblage')
show_images_for_type('Collage')
show_images_for_type('Mixed media')
show_images_for_type('Etching')
show_images_for_type('Drawing')
show_images_for_type('Painting')
show_images_for_type('Unknown')
show_images_for_type('Engraving')
We ultimately decided to remove the Collage, Stained glass and Sculpture types.
paintings_filtered = painting_sizes.drop(painting_sizes['article_type'].isin(['Collage', 'Stained glass', 'Sculpture']))
authors = authors.merge(paintings_filtered.groupby('author_id').aggregate({'painting_id': 'count'})\
.rename(columns = {'painting_id': 'num_paintings'}).reset_index(), how = 'inner',
on = 'author_id')
authors.set_index('last_name')['num_paintings'].nlargest(20)[::-1].plot.barh()
plt.xlabel('Number of paintings')
plt.ylabel('Author')
t = plt.title('Number of paintings from top 20 authors')
def text_plot(x, y, s, **kwargs):
plt.text(x, y, s.iloc[0].decode('ascii', 'ignore'), **kwargs)
fg = sns.FacetGrid(data=authors.groupby('nationality').aggregate({'author_id': 'count', 'num_paintings': 'sum'})\
.rename(columns = {'author_id': 'num_authors'}).applymap(np.log2).nlargest(40, 'num_paintings').reset_index(),
hue='nationality', size = 8, aspect = 2)
fg = fg.map(plt.scatter, 'num_authors', 'num_paintings')\
.map(text_plot, 'num_authors', 'num_paintings', 'nationality', fontsize = 12)
plt.title('Number of paintings and authors per nationality')
plt.xlabel('Number of authors (log2)')
t = plt.ylabel('Number of paintings (log2)')
authors['art_movement'] = authors['art_movement']\
.apply(lambda x: x.decode('ascii', 'ignore') if x is not np.nan else 'Unknown')
fg = sns.FacetGrid(data=authors.groupby('art_movement').aggregate({'author_id': 'count', 'num_paintings': 'sum'})\
.rename(columns = {'author_id': 'num_authors'}).applymap(np.log2).reset_index(),
hue='art_movement', size = 8, aspect = 2)
fg = fg.map(plt.scatter, 'num_authors', 'num_paintings')\
.map(text_plot, 'num_authors', 'num_paintings', 'art_movement', fontsize = 12)
plt.title('Number of paintings and authors per art movement')
plt.xlabel('Number of authors (log2)')
t = plt.ylabel('Number of paintings (log2)')
On the page on Art Movements, we find that some of them can be aggregated into larger categories. In order to help us visualize and also to make a predictor's job more attainable, we consider instead the larger categories of art movements.
art_movement_conversor_key = {'Nazarene': 'Romantic',
'Abstraction-Cration': 'Abstract', #converted to ASCII
'High Renaissance': 'Renaissance',
'Futurist': 'Expressionist',
'Bauhaus': 'Expressionist',
'De Stijl': 'Abstract',
'Fauvist': 'Expressionist',
'Early Renaissance': 'Renaissance',
'Suprematist': 'Abstract',
'Pointilist': 'Post-Impressionist',
'Mannerism': 'Renaissance',
'Caravaggisti': 'Baroque',
'Nabi': 'Post-Impressionist',
'Skagen': 'Impressionist',
'Northern Renaissance': 'Renaissance',
'Old Lyme Colony': 'Impressionist',
'Barbizon': 'Realist',
'Peredvizhniki': 'Realist',
'Hudson River School': 'Realist',
'Dutch Golden Age': 'Baroque'}
def convert_art_movement(movement):
return art_movement_conversor_key.get(movement, movement)
authors['sup_art_movement'] = authors['art_movement'].apply(convert_art_movement)
fg = sns.FacetGrid(data=authors.groupby('sup_art_movement').aggregate({'author_id': 'count', 'num_paintings': 'sum'})\
.rename(columns = {'author_id': 'num_authors'}).applymap(np.log2).reset_index(),
hue='sup_art_movement', size = 8, aspect = 2)
fg = fg.map(plt.scatter, 'num_authors', 'num_paintings')\
.map(text_plot, 'num_authors', 'num_paintings', 'sup_art_movement', fontsize = 12)
plt.title('Number of paintings and authors per art movement')
plt.xlabel('Number of authors (log2)')
t = plt.ylabel('Number of paintings (log2)')
def convert_date(x):
# removes the 'circa' parts, converts 'date unknown' to None
x = re.match(r'(?:circa )?(\d+-?\d*)?', str(x)).group(1)
if x is None:
return None
x = map(int, x.split('-'))
result = sum(x) / len(x)
return result if result <= 2017 and result >= 1000 else None
to_plot = paintings_filtered['painting_dates'].apply(convert_date)
f, ax = plt.subplots(1, 2, figsize = (18, 8))
plt.subplot(121)
to_plot.plot.hist(bins = 40)
plt.xlabel('Year')
plt.title('Paintings over time')
plt.subplot(122)
to_plot.plot.hist(bins = 40)
plt.yscale('log')
plt.ylabel('Frequency (log)')
plt.xlabel('Year')
t = plt.title('Paintings over time (log scale)')
fig = plt.figure(figsize = (10, 10))
paintings_filtered['painting_location'].value_counts()[32::-1].plot.barh()
plt.xlim(0, 3000)
t = plt.title('Locations with the most paintings')
plt.figure(figsize=(12, 6))
paintings_filtered['medium'].value_counts()[20::-1].plot.barh()
plt.xscale('log')
plt.xlabel('Frequency')
t = plt.title('Most frequent art media')